Set up:

If you don’t have the data from last time handy, or have forgotten where you put it, use the instructions from the last session to re-download it. Then set your working directory to where the data are:

setwd("~/r4grads/Fish_data/Modified/")

We’ll need our all_data object again. If you have your "fish_data_merged.csv" file that we wrote out last time, you can read in:

all_data <- read.csv("fish_data_merged.csv")

If you’ve lost track of this file, we can create it again:

# read in each of the data files
body <- read.csv("Fish_body_size.csv")
iso <- read.csv("Fish_isotopes.csv")
# Make all of the subsitutions that we made last time
body$Species <- gsub("Steelhesd", "Steelhead", body$Species)
body$Species <- gsub("COHO", "Coho", body$Species)
body$Species <- gsub("^Dolly$", "Dolly varden", body$Species)
body$Site <- gsub("RT02-R", "RT02R", body$Site)
body$Site <- gsub("RT02-BP", "RT02BP", body$Site)
# Merge the data
all_data <- merge(body, iso)



Tidyverse


So far, we’ve done all of the manipulation using base R functions. We can also do a lot of fancy data manipulation using the Tidyverse set of R packages, which all share some general principles in their architecture. I won’t go into the details of the Tidyverse philosphy, as it’s very well documented and you can read all about it at the link.


To start, you’ll need to make sure you have the Tidyverse packages installed and then loaded:

install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors


Before we actually start doing anything, there’s an important issue to point out in this loading message. The conflicts show that there are functions in dplyr and the base R stats packages that share the same function names. Whenever this happens, the function from the most recently loaded package will mask the other function. If you load dplyr last and then run filter(), what you’ll get is the function from dplyr. Alternately, if you load stats last, you’ll get the filter() function from that package.

This is important to keep track of. Especially if you write a script and then edit it to load up a package at the the top of a script. You can always call a function from a specific package by using the notation package::function(). The double colons tell R to explicitly use a function from the stated package before the colons.


Some of these functions are very similar to base R functions. E.g., str_replace() is very similar to gsub(), but with a different order of arguments:

body <- read.csv("Fish_body_size.csv")
body$Species <- str_replace(body$Species, "Steelhesd", "Steelhead")


dplyr and tidyr for filtering and summarizing


We can filter data using the filter() function, this is similar to how we filtered data using base R:

coho_data <- filter(all_data, Species == "Coho")
coho_data
big_coho <- filter(all_data, Species == "Coho" & Fork.length..cm. > 10)
big_coho

We can select only certain columns using select():

big_coho_size_vars <- select(big_coho, Weight..g., Fork.length..cm.)
big_coho_size_vars

We can add in columns that are combinations or transformations of others:

big_coho2 <- mutate(
  big_coho,
  mass2 = Weight..g. * 2,
  mass2_squared = mass2 ^ 2
)
big_coho2


We can also get useful summaries of our data grouped however we want. E.g., we can get the mean weight of each species from our all_data object:

by_spec <- group_by(all_data, Species)
summarise(by_spec, size = mean(Weight..g., na.rm = TRUE))
## # A tibble: 7 × 2
##   Species                  size
##   <chr>                   <dbl>
## 1 Chum                    0.32 
## 2 Coho                    3.38 
## 3 Dolly varden            9.66 
## 4 Pink                    0.147
## 5 Steelhead               7.29 
## 6 Three spine stickleback 1.32 
## 7 coastrange sculpin      7.33


Tidyverse has all sorts of functions like these that each do specific data manipulations that you can explore at your leisure. Many people love Tidyverse and use it for all of their data manipulation, others prefer to use base R for basic manipulations and use Tidyverse only for more complex operations. It’s totally up to you which you prefer and which makes more sense to you.


Tibbles

On top of all sorts of functions, Tidyverse also introduces a couple other important pieces of functionality: tibbles and pipes.

Tibbles are effectively an extension of the dataframe format that has very specific rules. You can read all about them here.

You can convert dataframes to tibbles like so:

big_coho_tib <- as_tibble(big_coho)
big_coho_tib
## # A tibble: 9 × 10
##   Fish.code Species Site   Date    Weigh…¹ Fork.…² del13C Std..…³ del15N Std..…⁴
##   <chr>     <chr>   <chr>  <chr>     <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
## 1 C01       Coho    RK17   11/10/…    13.2    10.2  -23.1    0     13.6     0.04
## 2 C117      Coho    RT03   5/19/93     8.9    10.2  -28.7    0.01  10.2     0.09
## 3 C128      Coho    RT02R  5/16/93     7.9    10.8  -22.2    0.1   13.6     0.02
## 4 C133      Coho    RP03BP 11/12/…    14.8    11    -33.7    0.01   9.69    0.09
## 5 C134      Coho    RP03BP 11/12/…    14.2    11.4  -32.9    0.01   9.03    0.08
## 6 C143      Coho    RT02BP 11/7/93    18.5    11.8  -22.6    0.13  15       0.11
## 7 C144      Coho    RT02BP 11/7/93    14.2    10.8  -22.6    0.01  15.2     0.12
## 8 C160      Coho    RT02   4/19/93    11.6    10.7  -21.4    0.09  13.9     0.1 
## 9 C77       Coho    RK14   4/4/93     17.4    11.5  -21.4    0.13  14.9     0.06
## # … with abbreviated variable names ¹​Weight..g., ²​Fork.length..cm.,
## #   ³​Std..error, ⁴​Std..error.1

Notice that this prints differently from a dataframe. There are some other key differences, including that Tibbles do not have row names (instead this information is expected to be as a column within the tibble). Use whichever you prefer. In many cases, they operate pretty much interchangeably, although some functions from various packages will require input data to be in one of the formats or the other – it is easy to convert back and forth, though.


You can also read data into R straight into a tibble using Tidyverse functions:

body_tib <- read_csv("Fish_body_size.csv")
## Rows: 413 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Fish code, Species, Site, Date
## dbl (2): Weight (g), Fork length (cm)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
body_tib
## # A tibble: 413 × 6
##    `Fish code` Species Site  Date     `Weight (g)` `Fork length (cm)`
##    <chr>       <chr>   <chr> <chr>           <dbl>              <dbl>
##  1 C01         Coho    RK17  11/10/92         13.2               10.2
##  2 C02         Coho    RK17  11/10/92          5.8                7.9
##  3 C03         Coho    RT02  11/18/92          8.6                8.9
##  4 C04         Coho    RT02  11/18/92         11.8                9.8
##  5 C05         Coho    RT02  11/18/92          5                  7.7
##  6 C06         Coho    RT02  11/18/92          5.3                8.1
##  7 C07         Coho    RK17  11/9/92           1.4                4.8
##  8 C08         Coho    RK17  11/9/92           1                  4.6
##  9 C09         Coho    RK13  11/13/92          2.6                5.9
## 10 C100        Coho    RK09  6/29/93           3.6                6.9
## # … with 403 more rows

Note that is almost the same as the base read.csv() function, just with an underscore instead of a period. You can also read in tab delimited data using read_tsv() and there are probably Tidyverse equivalents of most functions to read in data.



Pipes


Pipes allow you to chain (or “pipe”) together multiple functions into a single long command. If you’re familiar with the Linux/Unix command line, this is done with the | character. R uses that for some other purposes, so Tidyverse instead introduces the pipe as %>%.

When you pipe a function to another function, the output from the first function becomes the data argument of the second function, and you can link many pipes together.


For example, we can pipe together all of the commands we used above to make the big_coho2 object:

big_coho3 <- filter(all_data, Species == "Coho" & Fork.length..cm. > 10) %>%
  select(Weight..g., Fork.length..cm.) %>%
  mutate(
    mass2 = Weight..g. * 2,
    mass2_squared = mass2 ^ 2
  )
big_coho3
##   Weight..g. Fork.length..cm. mass2 mass2_squared
## 1       13.2             10.2  26.4        696.96
## 2        8.9             10.2  17.8        316.84
## 3        7.9             10.8  15.8        249.64
## 4       14.8             11.0  29.6        876.16
## 5       14.2             11.4  28.4        806.56
## 6       18.5             11.8  37.0       1369.00
## 7       14.2             10.8  28.4        806.56
## 8       11.6             10.7  23.2        538.24
## 9       17.4             11.5  34.8       1211.04


Try out piping together group_by() and summarise() to get an object that has the mean and standard deviation of the weight of each species from the all_data object. You’ll need to look up what functions calculate means and standard deviations.






















weight_sum <- group_by(all_data, Species) %>%
  summarise(size = mean(Weight..g., na.rm = TRUE), stdev = sd(Weight..g., na.rm = TRUE))

weight_sum


We could also add in some other columns using mutate:

weight_sum2 <- group_by(all_data, Species) %>%
  summarise(
    size = mean(Weight..g., na.rm = TRUE), 
    stdev = sd(Weight..g., na.rm = TRUE)
    ) %>%
  mutate(
    lower = size - stdev,
    upper = size + stdev
    )

weight_sum2
## # A tibble: 7 × 5
##   Species                  size   stdev   lower  upper
##   <chr>                   <dbl>   <dbl>   <dbl>  <dbl>
## 1 Chum                    0.32   0.0632  0.257   0.383
## 2 Coho                    3.38   3.35    0.0339  6.73 
## 3 Dolly varden            9.66  10.2    -0.519  19.8  
## 4 Pink                    0.147  0.0516  0.0950  0.198
## 5 Steelhead               7.29   5.95    1.34   13.2  
## 6 Three spine stickleback 1.32   0.545   0.770   1.86 
## 7 coastrange sculpin      7.33   3.86    3.46   11.2


Here, I’ve split up single functions onto multiple lines to improve readability: each column being created by summarise() and mutate() gets its own line.


There are advantages and disadvantages to piping together commands. A major advantage is that you don’t need to store intermediate data objects that you aren’t going to use as objects in R’s memory. A disadvantage is that if you pipe together too many commands, your code can start to become less readable.



As already mentioned, we’ve really only scarped the surface of how to manipulate data with either base R or Tidyverse. There is a whole book R for Data Science written by the primary developers (or at least originators) of Tidyverse. That link will take you to the site for the free book, which is filled with code examples.









Exploratory data visualization

So far, we’ve explored how we can manipulate our data in R, including reading in, subsetting, and merging datasets. These are all very important, but looking at data in the ways we have so far doesn’t give us a good intuition for what kind of patterns or distributions there are in the data. Do we have mostly large fish? Mostly small fish? Are the data roughly normally distributed? If you have far better mathematical intuition than I do, you might be able to guess at this from looking at a dataframe in the R console, but I sure can’t. Plotting the data gives us a much better idea of what’s going on.

We’ll start with a few basic plots using base R functionality, then move onto fancier graphics in ggplot.


Relationships among continuous variables

When collecting data, we likely have some hypotheses that we are testing or at least general ideas about what kinds of patterns we expect in the data. In many cases, we may hypothesize that two variables are correlated. We can visualize the association between two variables using a scatterpot.

Let’s quickly remind ourselves of the variables that are in our fish dataset:

colnames(all_data)
##  [1] "Fish.code"        "Species"          "Site"             "Date"            
##  [5] "Weight..g."       "Fork.length..cm." "del13C"           "Std..error"      
##  [9] "del15N"           "Std..error.1"

We would probably expect a relationship between weight and fork length, since it makes sense that longer fish are probably heavier fish. Let’s see what it looks like:

plot(all_data$Fork.length..cm., all_data$Weight..g.)


Cool, looks like there’s a relationship here, but it’s pretty hideous, so let’s clean it up a little.

plot(all_data$Fork.length..cm., all_data$Weight..g.,
     pch = 16, frame = FALSE, col = "blue", cex=0.5,
     ylab="Weight (g)", xlab = "Fork length (cm)")

There are a number of other graphical parameters you can tweak as you like. There are good resources all over the internet that can be easily found by searching things like “base R axis labels”.


This plot gives us a pretty good idea that there’s a correlation between our data, but it looks distinctly non-linear. This isn’t surprising, since larger fish don’t just get longer, but also wider and taller, so we’d expect the mass to increase more than linearly. This is a good example of when we might want to transform our data.

Try cube root transforming weight and then plot that against fork length. We’re using the cube root because mass typically increases cubically as a function of length (assuming that other dimensions stay roughly proportional). Remember that a cube root is the same as raising a number to the power of 1/3.

















cube_weight <- all_data$Weight..g.^(1/3)
plot(all_data$Fork.length..cm., cube_weight,
     pch = 16, frame = FALSE, col = "blue", cex=0.5,
     ylab="Cube root weight (g)", xlab = "Fork length (cm)")

## OR 
all_dat_wcube <- mutate(all_data, cubeweight = Weight..g.^(1/3))
plot(all_dat_wcube$Fork.length..cm., all_dat_wcube$cubeweight,
     pch = 16, frame = FALSE, col = "blue", cex=0.5,
     ylab="Cube root weight (g)", xlab = "Fork length (cm)")


This looks very linear now. Let’s use an actual statistical test to see if there is a correlation:

cor.test(all_data$Fork.length..cm., cube_weight)
## 
##  Pearson's product-moment correlation
## 
## data:  all_data$Fork.length..cm. and cube_weight
## t = 96.158, df = 407, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9741752 0.9824260
## sample estimates:
##       cor 
## 0.9786921

We have a very high correlation coefficient and a very low p value, cool. Remember, however, that we cube root transformed the weight data, so the interpretation is not that there is a linear relationship between weight and length. Rather there is a cubic relationship.



Since we have a hypothesis of causation (being larger causes a fish to be heavier), a linear regression is probably more appropriate here than a correlation test. This also allows us to plot out the regression line on our scatterplot:

reg <- lm(all_dat_wcube$cubeweight ~ all_dat_wcube$Fork.length..cm.)
plot(all_dat_wcube$Fork.length..cm., all_dat_wcube$cubeweight,
     pch = 16, frame = FALSE, col = "blue", cex=0.5,
     ylab="Cube root weight (g)", xlab = "Fork length (cm)")
abline(reg, col="black")

Note that here, the order of weight and length in the regression formula is important because we are saying that weight is a function of length (i.e., that length is the predictor/independent variable), not simply asking if there is a correlation.



Differences among groups

Beyond correlations among continuous variables, in many cases we are interested in differences in measurements among groups. Here we have several different species, and we expect that our measurements will differ across these. We’ll take the same general approach as above, starting with some simple visualization, then explicitly testing for differences among groups.

We can start by plotting out a distribution for a single species by filtering our data to a single species data and making a histogram.

coho_weight <- all_data[all_data$Species=="Coho", "Weight..g."]
hist(coho_weight, breaks = 30, xlab = "Coho weight", main = NULL)

We can also plot multipe histograms together in different colors to get a sense for how distributions do or do not overlap.

dolly_weight <- all_data[all_data$Species=="Dolly varden", "Weight..g."]
hist(coho_weight, breaks = 20, xlab = "weight", main = NULL, xlim = c(0,70),
     col="blue")
hist(dolly_weight, breaks = 60, add = TRUE, col = "red")


A better way to plot this would be to make the histograms partially transparent. We won’t explore that here, but as for most things in R, there are plenty of resources on the internet: https://stackoverflow.com/questions/3541713/how-to-plot-two-histograms-together-in-r

Even with some transparency, if we plotted all seven of our species, this would get hard to interpret pretty quickly. One option for plotting multiple distributions is to use box plots. Personally, I prefer violin plots from ggplot, but we’ll get to those later on.

par(mar=c(10,4.1,4.1,2.1))
boxplot(Weight..g. ~ Species, data = all_data, main="Fish weights",
   xlab = NULL, ylab = "Weight (g)", las = 2)

This looks ok. There is a lot more customization we could do with this, but again, we won’t get into that here. The par() function sets graphical parameters, and we used it to set the margins mar() so that the bottom margin is wider than the default to accommodate the vertical labels. You can find more info here.


Let’s write this plot out to pdf. You can save plots directly from RStudio using some GUI options, but the quality is not as good as plotting to a pdf or other image format using pdf(), png(), etc. I personally prefer pdf format because it is vector-based and maintains resolution under infinite zoom. We use the function pdf() to open a pdf plotting device, then plot like normal, but now everything being plotted goes to the pdf file that is open for writing. When we’re done, we use dev.off() to turn off and close the active plotting device, leaving us with our completed pdf file.

pdf(file="boplot.pdf", width=6, height=5)
par(mar=c(10,4.1,4.1,2.1))
boxplot(Weight..g. ~ Species, data = all_data, main="Fish weights",
   xlab = NULL, ylab = "Weight (g)", las = 2)
dev.off()


We can see here that it looks like we have some major differences in body sizes across the different fish species. Let’s test that out using ANOVA.

There are various ways to run ANOVAs in R, one is using the aov() function. This function is structured similarly to the lm() and boxplot() functions. Try using ?aov() to figure out how to run an ANOVA that will tell us if body weights differe among species. You might need to search the internet for how to specify formulas in R.

































res.aov <- aov(Weight..g. ~ Species, data = all_data)
summary(res.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       6   3983   663.8   20.15 <2e-16 ***
## Residuals   402  13245    32.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Looks like we have some some highly significant differences among the species here. ANOVA itself won’t tell us what is different from what, though, so let’s run a quick Tukey post-hoc test.

TukeyHSD(res.aov)


Keep in mind that ANOVA carries assumptions like equal variance among groups and normality that may be violated in this data. When running any test with such assumptions, you should be eplicitly testing that your data conform to them. If your data do not meet the test’s assumptions, your results may be biased, and so data transformation or the use of a non-parametric test that does not carry the same assumptions may be a better alternative. You can find more detail on testing ANOVA assumptions and ANOVA alternatives in R here: http://www.sthda.com/english/wiki/one-way-anova-test-in-r

All statistical tests make certain assumptions about the data, and tests vary in how robust they are to violations of those assumptions. In the end, you will need to know the specifics of both your dataset and the relevant statistical tests to know what methods will be applicable to your data.



ggplot


ggplot is an extremely popular R package that can make all sorts of very nice graphics. If you get familiar with the general style of ggplot, you’ll pretty quickly start to notice that many figures in research papers are made with this package.


ggplot syntax

ggplot makes some very nice figures, but it has a whole syntax to it that can take a little while to learn. We’ll make a quick plot, then explain what’s going on here:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g.))


All calls to ggplot() are composed of at least two pieces. The first simply specifies the object that contains the data. All of the data for the plot should be in a single object, with different variables in different columns and each row specifying a single observation of a data (this is part of the general “Tidy” data philosophy).

The basic ggplot() call without adding in a geom function will just render a blank plot, since you haven’t told it what kind of plot to make with the data - this is unlike the base R plot() function we used above, which will try to guess what type of plot you want based on the nature of the data.

ggplot(data = all_data) # this makes an empty plot


In the full call:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g.))

The geom_point() function tells ggplot to plot out the data as points. Within this function, the mapping argument specifies how the data are mapped to the visualization. The mapping is specified using the aes() (aesthetic) function. Here we specify only which variable is x and which is y, but there are other things we can specify as well.

  • Multiple geom functions can be combined into a single plot, and the mapping argument can be specified independently in each, or can be specified globally within the ggplot() function call.


Aesthetic mappings

We’ve so far plotted out two variables, but we can add information about additional variables by passing additional arguments to aes(). For example, we can change the shape the points by species:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g., shape = Species))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 33 rows containing missing values (`geom_point()`).

  • Note the warning that we can only assign up to 6 shapes to variables, and see that three spine stickleback has been left out because of this.


Instead we can change the color by species:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g., color = Species))


We can also scale the size of the points by some variable using the size argument in aes(). Try modifying the above plot so that the points are sized by del15N.

































ggplot(data = all_data) + 
  geom_point(mapping = aes(
    x = Fork.length..cm., 
    y = Weight..g., 
    color = Species,
    size = del15N))

This is a very busy plot at this point, and for me, all points are too large to be interpretable, so let’s add a function that controls the range of point sizes:

ggplot(data = all_data) + 
  geom_point(mapping = aes(
    x = Fork.length..cm., 
    y = Weight..g., 
    color = Species,
    size = del15N)) +
  scale_size(range = c(0.1, 2))

This is a little better, but still is clearly not the best way to display these data. Just because you can plot things in a certain way, doesn’t mean you should plot them that way.


If we want to change the color (or size, shape, etc.) of all points, rather than according to a variable, we can do this by pulling that aesthetic outside of the aes() function and setting it manually:

ggplot(data = all_data) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = Weight..g.), color = "blue")



Geometric objects and facets

There are many other types of geoms, or geometric objects, that we can plot. We won’t come anywhere close to exploring the full set that are available, but we’ll demonstrate a few others.

Here is a smoothed line plot of length vs. del15N:

ggplot(data = all_data) + 
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


We can show the different species by setting linetype:

ggplot(data = all_data) + 
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N, linetype = Species))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 3.2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 3.2
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 0.1
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0

This does not look good at all and throws some warnings These are caused because at least one of our species has very few observations, and this causes issues with the method used to create the smoothed line.

  • Warnings vs. errors: An error in R means that something has gone wrong such that the function you tried to run could not be successfully executed. Warnings are potentially less severe. It means that some potetentially undesirable behavior has occurred, but that the function is still able to run and return output. However, a warning could very well mean that your function has returned bad results, and you should always investigate what caused a warning and if the output is acceptable or not.


In this case, it might be useful to look at each species plotted individually. However, I don’t want to go through and individually code each plot. Instead, let’s use the facet functionality of ggplot:

ggplot(data = all_data) + 
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N)) +
  facet_wrap(~ Species)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 3.2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 3.2
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 0.1
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0

We can start to see where the problem might be, but the x and y scales for these different species really aren’t comparable, so let’s allow those to change freely:

ggplot(data = all_data) + 
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N)) +
  facet_wrap(~ Species, scales = "free")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As is the case with most things in ggplot and R more generally, there are many ways to customize facets, and we won’t explore those here.

  • Which of these species looks problematic? Try making a single plot using geom_smooth() including linetype = Species as above, but remove the offending species.

































I removed only “Pink” - you might have decided to also remove “Chum” - either is fine.

Using base R indexing:

ggplot(data = all_data[!all_data$Species == "Pink", ]) + 
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N,  linetype = Species))

Using filter from dplyr:

ggplot(data = filter(all_data, Species != "Pink")) + # Note the != for "not equal to"
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N,  linetype = Species))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This works, but, it’s still rather hard to tell what’s going on with each species.

  • Try adding an argument to make the lines different colors.

































ggplot(data = filter(all_data, Species != "Pink")) +
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N,  linetype = Species, color = Species))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'



This looks reasonably good, but still a little busy. This may or may not be the best way to convey these data. For the next few things we try, let’s reduce out dataset down to just Coho, coastrange sculpin, and Steelhead.

data_red <- filter(all_data, Species %in% c("Coho", "coastrange sculpin", "Steelhead"))



We will often want to stack multiple geoms on top of each other. E.g., to fully visualize the trends and spread in the data, we might want to plot both the points and the line:

ggplot(data = data_red) + 
  geom_point(mapping = aes(x = Fork.length..cm., y = del15N)) +
  geom_smooth(mapping = aes(x = Fork.length..cm., y = del15N))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


If we have multiple layers in our plot that all share the same aesthetic, it can be easier to specify them once in the ggplot() call rather than idividually each time we call a new geom:

ggplot(data = data_red, mapping = aes(x = Fork.length..cm., y = del15N)) + 
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


We get exactly the same plot. The global mappings are applied to each geom when it is called. We can also extend or override the global mappings within each geom:

ggplot(data = data_red, mapping = aes(x = Fork.length..cm., y = del15N)) + 
  geom_point(mapping = aes(color = Species)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'



  • Make a plot that has both points colored by species, including separate colored lines for each species as well.

































ggplot(data = data_red, mapping = aes(x = Fork.length..cm., y = del15N, color = Species)) + 
  geom_point() +
  geom_smooth(mapping = aes(linetype = Species))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


Note that we can also add the data argument into a geom. Here we’ll plot the points for all species, but the line for only Coho:

ggplot(data = data_red, mapping = aes(x = Fork.length..cm., y = del15N, color = Species)) + 
  geom_point() +
  geom_smooth(data = filter(data_red, Species == "Coho"))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There are many, many ways to futher customize graphics in ggplot. If there is a type of plot you want to make, there is a good chance that ggplot will make it and you can usually find good examples with some internet searching. I turn to Google all the time to figure out how to make specific plots in ggplot.



Plots by discrete variables

We’ve made a bunch of scatter and line plots, let’s explore a few ways to plot out discrete variables.


We can make just a scatterplot, but it’s really not very informative.

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_point()



We can make a boxplot instead:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_boxplot()

This tells us a lot more information. We can also do things like swap the x and y coordinates:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_boxplot() +
  coord_flip()

Or fill the boxes with colors by species:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N, fill = Species)) + 
  geom_boxplot() +
  theme(legend.position="none")

  • What does theme(legend.position="none") in the above code do? What happens if we remove it?





There are lots of other things we could do with boxplots, but we’ll move on. My favorite way of visualizing distributions for discrete variables is using violin plots:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE)

This is fine, but not great. Change this plot so that the violins are filled with different colors for each species. Also get rid of any legend that shows up.

































ggplot(data = all_data, mapping = aes(x = Species, y = del15N, fill = Species)) + 
  geom_violin(trim=FALSE) +
  theme(legend.position="none")


If we want to, we can add boxplots within the violins. This is similar to adding the geom_smooth to the scatterplots above. Try making a violin plot with boxplots inside.

Make the violins filled by different colors for each species, but keep the boxplots unfilled. As a hint, the argument width controls the width of boxplots.

































ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  theme(legend.position="none")

This looks pretty fancy to me.



A couple final things:

We can change the axis labels and add a plot title if we want:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  theme(legend.position="none") +
  labs(title="Fish nitrogen isotopes", x = "Species", y = "delta 15 N")

And/or we can change the theme of the background:

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = "Species", y = "delta 15 N") +
  theme_minimal() +
  theme(legend.position="none")

  • Note that we had to move theme(legend.position="none") to after theme_minimal() because whichever comes last will override some of the parameters set by the one that is run first.

or

ggplot(data = all_data, mapping = aes(x = Species, y = del15N)) + 
  geom_violin(trim=FALSE, mapping = aes(fill = Species)) +
  geom_boxplot(width=0.1)+
  labs(x = "Species", y = "delta 15 N") +
  theme_classic() +
  theme(legend.position="none")





You customize each of these plots almost endlesslessly, and there are many other types of plots that you can create, as well. If you want to get an idea for some of the plotting diversity available, check out the the R graph gallery.